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Abstract 


We present a complimentary objective for training recurrent neural networks 
(RNN) with gating units that helps with regularization and interpretability of the 
trained model. Attention-based RNN models have shown success in many difficult 
sequence to sequence classification problems with long and short term dependen¬ 
cies, however these models are prone to overfitting. In this paper, we describe 
how to regularize these models through an LI penalty on the activation of the gat¬ 
ing units, and show that this technique reduces overfitting on a variety of tasks 
while also providing to us a human-interpretable visualization of the inputs used 
by the network. These tasks include sentiment analysis, paraphrase recognition, 
and question answering. 

1 Introduction 

Attention-based recurrent neural networks (RNN) have shown great success in a wide range of tasks 
such as computer vision EEiia ,image generation usi , machine translation i) , speech recognition 
la, or even as controllers for memory addressing and retrieval EEl. 

While there is debate as to how biologically plausible these cognition models are, they are desirable 
in their ability to allow introspection into the network’s workings and understanding failures: in 
the case of image captioning (SKU and generation (HO, or emotion detection ifTOl . the system’s 
focus matches up with human intuition. The gates modulating the network’s attention in these 
networks serve a dual purpose: first they allow control of the information fiow, and second, and 
perhaps more crucially, the gates communicate problem structure by ensuring that specific groups 
of neurons activate or go dormant jointly. For instance, in the case of prediction from a sequence of 
words, it is expected to find that certain words are predictive while others not; if this word sequence 
is projected using an embedding matrix into word vectors, then by the same logic all the dimensions 
of superfluous words’ vectors should be wiped out entirely. 

Intuitively, this Occam’s Razor observation can be translated into considering that the activation 
of gating units should be as sparse as possible when not all the words or information units are 
necessary. The main focus of this paper is to show how to enforce sparsity on gating units by adding 
an unsupervised training objective: the sum of the activations of the gating units gi weighed by a 
hyper-parameter Agparse that controls the tradeoff between the original objective function J and the 
sparsity criterion: 



In this work, we show that enforcing gate sparsity improves generalization in RNNs while also pro¬ 
viding useful visualisations of the problem, and evaluate this approach on three different problems. 

2 Related Work 

The work we are presenting is closely related to two areas of Machine Learning research: RNN 
regularization and attention-based models. 
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2.1 RNN Regularization 

RNN regularization has recently been shown to be achievable using Dropout CD by regularizing 
a subset of the recurrent connections in deep RNNs Gaea. Previously, it was shown that weight 
decay regularization only provided small improvement O and dropout noise was detrimental when 
applied to all connections due to the compounding of errors over time M- In this work, we show 
that this problem can also be solved using a deterministic approach by penalizing gate activations 
from deep RNNs. As a result, RNNs can now benefit from multiple regularization techniques in 
varying architectures. 

2.2 Attention-Based Models 

In recent years, there has been a wealth of evidence that attention-based techniques can improve 
the performance of machine learning models. Examples of this include work on capturing visual 
structure through a sequence of glimpses through images EKElIslIDEllIol, and networks that learn 
how to attend to and control a separate memory iiiisiiiii. 

In certain cases the models are trained with supervision on the gates ClIISl, however in many cases 
there is no supervised data for the attentional component. Several surrogate objectives have been 
suggested for learning where to focus, including setting a prior on observation spacing that makes 
a tradeoff between exploration and exploitation ca, using reinforcement learning ||9l to optimize a 
visual tracking strategy 0, or leaving this part semi-supervised through the primary objective. Our 
work resembles the observation prior of ca, where we favor input gates being closed and penalize 
deviation with a penalty of our choosing. Similarly to the annealed Dropout from ca, we also 
consider a gradual increase in the sparsity penalty during training to encourage early exploration. 


3 Problem Statement 

A powerful family of models, often called Encoder-Decoders, have opened many new possibilities 
for sequence classification BIllIIll, including executing Python programs EolED, drawing pic¬ 
tures Ql , machine translation, or syntactic parsing |[22l|23l. The main problem we are trying to solve 
in this paper is improving generalization performance when performing these types of classical or 
structured prediction tasks using RNNs. In sections below we describe three different sequence 
classification problems used to evaluate our approach. 

3.1 Sentiment Analysis 

The central problem in sentiment analysis is correctly identifying and extracting the attitude or 
emotional tone of a speaker in the context of a particular topic or domain. 

Here we consider predicting the sentiment expressed in the Stanford Sentiment Treebank (SST) 
1241 . a collection of 11,855 sentences extracted from movie reviews. This dataset is made up of the 
sentiment annotations from 5 classes: {terrible, bad, neutral, good, terrific}, for the 215,154 unique 
sub-phrases obtained after parsing each sentence using the Stanford Parser. In our work we do not 
make use of the parse trees, and instead treat each sub-phrase as a labeled sequence of words. 

3.2 Paraphrase Recognition 

In Paraphrase Recognition the problem is it to predict how semantically similar two phrases are from 
0 to 1. This task can either be seen as regression or binary classification, and the goal is measured 
as the Pearson correlation with human annotations or recalling correct paraphrase pairs. 

Here we focus on paraphrase detection on the SemEval 2014 shared task 1 dataset 1^ which in¬ 
cludes 9927 sentence pairs in a 4500/500/4927 train/dev/test split. Each sentence is annotated with 
a score c G [1,5], with 5 indicating the pair is a paraphrase, and 1 that the pair is unrelated. We 
additionally train using paraphrase pairs from the wikianswers paraphrase corpus 1261 . 

3.3 Question Answering 

Eacebook AI Research recently proposed a set of 20 tasks designed to be “prerequisites” for any 
system “capable of conversing with human” 1271 . The dataset for each task is a set of stories each 
composed of many facts, with some marked as relevant, a question and the correct answer. 
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Daniel and Sandra journeyed to the office. 
Then they went to the garden. 

Sandra and John travelled to the kitchen. 
After that they moved to the hallway. 
Where is Daniel? A: garden 


The football fits in the suitcase. 

The suitcase fits in the cupboard. 

The box of chocolates is smaller than the 
football. 

Will the box of chocolates fit in the suitcase? 
A:yes 


The tasks are synthetic and lack noisy nature of real-world natural processing, which makes them 
easy to solve with hand engineered systems, however the open question is how to create a model 
capable of solving these tasks without any manual feature engineering for particular problems. 


4 Approach 

In order to improve RNN performance over unseen data apply Occam’s Razor over our training data 
by finding in each example a minimal set of useful inputs over time. To achieve this property we 
apply gates to the different observations of the input sequence to allow the network to keep or erase 
a timestep’s input. For instance, in a sentiment classification problem, gates would ideally fire only 
for emotionally loaded words, and stay dormant otherwise. 

Because our approach relies on gates, we make the assumption that the vector input at each time-step 
is an inseparable information unit, like a word, image, or fact. If this assumption holds, then when 
we force the network to reduce its gate usage by penalizing the sum of those activations, we will 
obtain a solution in a local optima where gates are less often active, which should generalize better. 

We formalise our approach by describing how we enforce sparsity on the gate activations for a 
variety of RNNs. Then we introduce the RNNs considered for the different tasks in this paper. 
Finally we explain the sparsity-enforcing objective function and our different annealing regimens 
during training. 

4.1 Gated LSTMs 

In our work we make extensive use of Long-Short Term Memory networks 1^ . a popular RNN 
architecture specifically designed to capture long range dependencies and alleviate training difficul¬ 
ties 1291 . Since their introduction in 1995, many variants have been proposed l30l . however for the 
purposes of this research we found that the vanilla version from l30l worked best. 


Table 1: LSTM and Gated LSTM equations 


description 

symbol 

LSTM 

Gated LSTM 

Occam’s gate 

^occam 

absent 

/gate 5 l) 

gated input 


absent 

’ ^occam 

block input 

Zt 

tanh 

tanh {^zx't + 'R-zVi-i + b^ 

input gate 

H 

a (WiXt + HiVt-i + 

o (WiXt + HiVt-i + 

forget gate 

ft 

a (WfXt + R/yt-i + 6/j 

o(yVfXt + Rfyt-i+bf'j 

memory state 

mt 

h&zt + ft® mt-i 

identical 

output gate 

Ot 

O' (^WoXt + H-oVt-i + bo^ 

cr (WoX't + RoVt-l + bo'j 

hidden state 

yt 

Ot ©tanh(ci) 

identical 


While LSTMs are capable of selectively remembering or forget parts of their memory and input, they 
lack the ability to transform uniformly their input. We extend LSTMs to include an additional gate, 
5'occam, that uniformly multiplies all the inputs simultaneously. In Table we present equations 
for the Gated-LSTM, with the differences with the regular LSTM highlighted in red. We use the 
following denotations: cr(-) for the logistic sigmoid function, and foi* matrices, 

and bz4j,o foi* vectors. 
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The gating function /gate(') can take various forms. Two examples we consider are linear function 
of the input Xt and a second order gate capable of capturing higher-order interaction: 

/linear(^t, ^t-l) = ■ Xf + f ■ ht-1 + &) 

/quad(^t, ht-i) = a{h^ -x^ -xt + f ■ ht-1 + b). 

Additionally, we consider Gated Stacked LSTMs, a variant of Stacked LSTMs |[3TJ[5l|20l, where 
the input the lowest LSTM is gated using the hidden state from the topmost LSTM of the previous 
timestep. The equation for this modification is as follows, with / G {1, /max}, the LSTM level: 

^occam — /gate(^t5 1))' 


4.2 Hierarchical Gated LSTMs 


phrasO]^ 

phrase2 

phrase^ 



Figure 1: Architecture for Hierarchical Gated-LSTMs 


In this section we introduce Hierarchical Gated LSTM (HG-LSTM), a gated attention model that 
uses Gated LSTMs as a central building block. In the previous section we introduced Gated LSTMs 
that are able to selectively ignore or include the entire input at a timestep, however for many tasks 
where the information presented can be subdivided into larger chunks such as sentences, paragraphs, 
or episodes, a similar gating procedure could be applied to these higher levels of abstraction. For 
example to find the answer to question about a story in the bAbI dataset, such a model would benefit 
from being selective about which words and facts to listen to. 

HG-LSTM consists of two submodels: a Fact model and High-Level model (HL model), which are 
both Gated LSTMs. Figure presents the architecture. Every word in a fact sequence is projected 
using an embedding matrix and processed by the Fact model. The final hidden state of the Fact 
model for each fact is then passed to the HL model as an input vector. We consider the final hidden 
state of the HL model after reading each fact representation to be a the high-level representation 
for the entire sequence of facts. The hierarchy of the submodels explicitly leverages the problem 
structure, and allows fine grain attention control at two levels of abstraction. 


4.3 Sparsity Penalty 

The original training objective J is augmented with the sparsity penalty and the resulting objective 
is optimized throu gh g radient descent. The penalty is constructed by summing the activations of the 
gates presented in 14.11 and weighing them by a parameter Agparse chosen through hyperparameter 
search: 


J* = J -h A, 


‘sparse 


^ ^ ^occam,i- 


i=l 


4.4 Training Regimens 

Our approach’s ultimate goal is to preserve network expressivity while making it robust against 
changes in the input. However, forcing sparsity too soon can do more harm than good: a greedy 
and locally optimal solution is forcing all gates to be closed. To prevent this from happening we en¬ 
courage early exploration by progressively increasing the sparsity penalty, Agparse- We investigated 
2 different annealing regimens: a linear and a quadratic increase up to Amax at training epoch T^ax, 
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as shown below with e the training epoch: 



^max 5 


^max 1 


^max} linear regimen 
, ^max } quadratic regimen 


flat regimen 


5 Experiments 

The code needed to run the experiments in this paper are available online at https : //www. 
github.com/ JonathanRaiman/DaliQ 

5.1 Sentiment Analysis 

For this problem our model is a Gated LSTM that reads each sequence of words sequentially, and 
uses the last hidden vector as input to a softmax linear classifler, and our target is to minimize the 
Kullback-Leibler divergence with the correct label along with the sparsity penalty. 

We project each word using an embedding matrix into a 100 dimensional vector, and keep only 
the words that appear at least twice in our training data, with the remaining words replaced with 
a special unknown word, <UNK>. We train 3 different models with hidden sizes 25, 50, 150, and 
apply Dropout |[IIl[l2l with probability p = 0.3 to the non recurrent connections of the LSTM. All 
models are trained using Adadelta 13^ with p = 0.95, and we perform early stopping when the 
accuracy stops increasing on the validation set. 

5.2 Paraphrase Detection 

For paraphrase prediction we also employ Gated LSTMs with the flnal Softmax layer removed. Each 
sentence in a pair is fed to a separate LSTM and our objective is to minimize the squared difference 
between the true similarity t of the sentences and the dot product of the two LSTMs’ flnal hidden 
states hi^h 2 '. 



instead of a softmax linear classifler, we instead use the last hidden state of the LSTM. 


5.3 Facebook’s bAbI dataset 

For this problem we use an HG-LSTM to compute the high level representation of each story. The 
HG-LSTM takes a question, followed by the sequence of facts, and the flnal hidden state of the 
HG-LSTM is fed as input to an LSTM decoder that produces the answer sequentially and ends its 
prediction with an <EOS> symbol BEl. 

We use separate a Gated-LSTM for question and facts when creating representations for the High- 
Level model in the HG-LSTM. To make the question influence the High Level’s input gates we 
average the embeddings of the words in the question and concatenate this with the fact representation 
and the current hidden state of the High Level model. 

Our error function is the sum of three separate objectives: 


i^prediction = L] L] “ax (7 - s{w) + s(u;), 0 ) 


wEY w^w 


-S'fact — E 3“ /^unsupporting E log(l - Qi) 



i^F 


Prediction error L^prediction deflned as margin loss on every word of the output,where E is a target 
sequence of words, s(rc) is a score a particular word and 7 is margin. We found that it signiflcantly 

^The project is currently under heavy development, do not hesitate to ask the authors for help! 
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decreases training time compared to cross entropy error while achieving similar results. 

For fact selection error F^fact a set of supporting facts S is known, therefore rather than using sparsity 
penalty, we used cross entropy error between expected (1 for / G S and 0 otherwise) and actual gate 
activation. F is set of fact indexes, gi is activation of gate for fact i. The /iunsupporting coefficient 
was introduced because authors reasoned that false negatives are potentially more harmful than false 
negatives for network learning process. 

Finally E’word is a LI sparsity penalty for all the word gates in fact model. Symbol denotes gate 
activation for a particular word in a particular fact. 

We combine the errors into a single objective: 

E — F/pr0(^iction '^fact-^fact '^word-^word 

Our precise parameters for the experiment were as follows: all word embeddings have 50 dimen¬ 
sions, we used Dropout with p = 0.5 in the High Level model and p = 0.3 for Question and Fact 
models. The Fact model has a hidden size of 30, while the High level model is a Gated Stacked- 
LSTM with 6 layers and a hidden size of 20. All the gates used are second order, /quad(-)- 

We use the first 1000 examples for training as suggested in Eli, and reserve 20 % for validation. 
Our model is trained using AdaDelta 13^ , with p = 0.95, and a minibatch size of 50. We perform 
early stopping when the validation score stops increasing. 


6 Results 

6.1 Effects on performance 

Occam’s gates improve generalization on sentiment analysis (fig. [^, paraphrase prediction (fig. |^, 
and for the majority of bAbI question answering problems (fig. Tablej^. This effect is especially 
visible as model size increases (fig. 0 fig. 0 - We find that without a sparsity penalty increasing 
model size has smaller effect, however using sparsity we manage to achieve 5% improvement on 
sentiment analysis and 18% on paraphrase prediction recall. Additionally for three arg. relations 
bAbI problem it increases the accuracy by 14%. We observe greater improvements on this task than 
the other two; notably, this task has longer sentences, and thus word gating is more present. 

Moreover, the sparsity annealing methods described in section [4^ show improvements over a static 
objective function (fig. ^ fig. [?]). In particular, the linear regimen improves the result by 1% for 
sentiment analysis, and by 7% for recall on paraphrase prediction. 

Finally, we observed that the HG-LSTM model significantly improves performance over the LSTM 
baseline from fTTX . As visible in tablethis model improves scores on 17 out of 20 problems. 
Moreover, HG-LSTMs with no penalties, Xword = Xfact = 0^ yields worse results than those with 
penalties for the majority of the problems (17 out of 20 tasks). Our best results are achieved by 
using mixture of both fact detection penalisation and word sparsity (7 out of 20 task). The HG- 
LSTM performs worse than Memory Networks (MemNN), however our model appears to be less 
computationally costly since we do not require branch and bound search to select supporting facts. 




39 

38 

37 


FI score for training regimen vs. sparsity penalty A 



memory sparsity 



39 

38 

37 


Figure 2. SST Root Accuracy with varying Figure 3: Effect of sparsity regimen and spar- 
LSTM hidden size and sparsity penalty A 33 ^ Accuracy. 


6.2 Interpretability 

Ability to interpret the calculation carried out by Machine Learning models is crucial for advancing 
research. Especially for Neural Network models there are no well established methods for under- 
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Figure 4: Paraphrase accuracy with varying 
LSTM hidden size and sparsity penalty A 


Figure 5: Effect of sparsity regimen and 
penalty A on Paraphrase prediction. 
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Figure 6: Accuracy for three bAbI tasks with 
varying Aword 
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Figure 7: Effect of Aword and A word on Basic 
Induction task accuracy 


standing its capabilities, although attempts have been made, e.g Hinton Diagrams 13^ . We claim 
that Occam’s Razors provide some insights into the way network operates on it’s hidden state. 


6.2.1 Error analysis 

Diagnosing and identifying the root cause of errors during model design is critical for finding with 
new research directions and making improvements. We believe using Occam’s gates can help re¬ 
searchers gain insight into their network’s workings. To support this claim let us consider an example 
from bAbi dataset where gates provide a visual indication of progress. 


Sandra journeyed to the hallway . 
Mary moved to the bedroom . 
Mary went back to the office . 
Sandra moved to the bedroom | 

Where is Mary ? office 


Sandra ^ | 

Mary went back to the hallway | 
Mary moved to the bathroom | 
Mary went to the hallway | 

Where is Mary ? hallway 


Mary 11 b^room | 

John IH hallway | 

Sandra went to the garden | 

johnBBliHH^HI 

Where is Sandra ? garden 


Figure 8: Example story from the single supporting fact bAbi. Activation of word gates is shown 
with yellow highlighter. Text opacity refiects the activation of the fact gate for the sentence. The 
images were taken when validation accuracy was 20%, 60% and 100% (left to right). 


In Figurej^we notice that the model upon reaching a validation accuracy of 20% is not yet capable of 
distinguishing important information from noise. At 60% accuracy it can now highlights the relevant 
facts, but the gates on words are not yet compelling. At 100% accuracy fact and word gates work 
in unison: the network activates for fact with the relevant person and words that contain location 
information. We hypothesize that LSTMs without gates can pick out the correct person and place, 
but Occam’s gates help them ignore facts about persons irrelevant to the question. 


6.2.2 Relevancy detection 

We argue that Occam’s gates allow one to judge which pieces of information are relevant to a 
problem. To illustrate this claim we show two examples, both of which emerged when training 
the system on a paraphrase detection problem with a Character model Gated LSTM (Char Gated 
LSTM). Ligure supports the belief that the model makes use word boundaries, and figure [T^ 
suggests that the network can ignore repetititve or superfiuous characters. 
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Table 2: Comparison of test accuracy on bAbI dataset from with different models. Models 
are (left to right): LSTM baseline from l27l . followed HG-LSTM with: no penalty, word sparsity 
penalty only, fact selection penalty only and both. The last column is MemNN. 


task 

LSTM 

No 

penalty 

single supporting fact 

50 

81 

two supporting facts 

20 

32 

three supporting facts 

20 

19 

two arg relations 

61 

76 

three arg relations 

70 

51 

yes-no questions 

48 

48 

counting 

49 

76 

lists sets 

45 

78 

simple negation 

64 

67 

indefinite knowledge 

44 

45 

basic-coreference 

72 

87 

conjunction 

74 

75 

compound-coreference 

94 

73 

time reasoning 

27 

27 

basic deduction 

21 

39 

basic induction 

23 

44 

positional reasoning 

51 

52 

size reasoning 

52 

54 

path finding 

8 

8 

agents motivations 

91 

95 


The Atlanta Falcons have pick Desmond Trufant in the 2 
we got Desmond trufant from Washington 

Figure 9: Char Gated LSTM, gate action 
shown with yellow highlighter. Model discov¬ 
ers tokenisation. 


Word 

Fact 

Fact, 

MemNN 

Penalty 

45 

Penalty 

100 

word 

99 

100 

19 

30 

32 

100 

20 

16 

20 

100 

65 

76 

77 

100 

66 

40 

31 

98 

51 

50 

50 

100 

65 

69 

70 

85 

66 

76 

73 

91 

65 

70 

69 

100 

47 

40 

44 

98 

50 

88 

89 

100 

66 

99 

99 

100 

93 

91 

86 

100 

19 

18 

18 

99 

50 

24 

50 

100 

42 

47 

40 

100 

52 

52 

58 

65 

90 

89 

50 

95 

8 

8 

8 

36 

63 

66 

96 

100 


Hmmmmmmmmmmm Jeremy Lin out again in the playoffs 
Smh at Jeremy Lin talkin bout a dude fallin off 

Figure 10: Char Gated LSTM, gate action 
shown with yellow highlighter. Model focuses 
on upper case characters and ignores repeats. 


7 Conclusion 

In this paper, we investigated the use of a complimentary objective function that forces attention- 
based RNNs to be selective about their inputs. We showed on three different tasks that our approach 
improves generalization and interpretability of the trained models with respect their counterparts that 
do not use sparsity penalties. Additionally, to encourage early exploration and preserve sparsity, we 
designed an annealing objective function that provides benefits over a standard one. 

Finally, we introduced Hierarchical-Gated LSTM, a new model that performs significantly better 
than regular Stacked LSTMs; this network combines attentional and hierarchical components, and 
reasons at several levels of abstraction. Future work includes investigation of this model family, 
which shows promise towards advancing the state of the art. 
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